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Abstract 

Background: Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life 
science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, 
Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously 
described a tool called WENDI for aggregating information pertaining to new chemical compounds, effectively 
creating evidence paths relating the compounds to genes, diseases and so on. In this paper we examine the utility 
of automatically inferring new compound-disease associations (and thus new links in the network) based on 
semantically marked-up versions of these evidence paths, rule-sets and inference engines. 

Results: Through the implementation of a semantic inference algorithm, rule set, Semantic Web methods (RDF, 
OWL and SPARQL) and new interfaces, we have created a new tool called Chemogenomic Explorer that uses 
networks of ontologically annotated RDF statements along with deductive reasoning tools to infer new 
associations between the query structure and genes and diseases from WENDI results. The tool then permits 
interactive clustering and filtering of these evidence paths. 

Conclusions: We present a new aggregate approach to inferring links between chemical compounds and diseases 
using semantic inference. This approach allows multiple evidence paths between compounds and diseases to be 
identified using a rule-set and semantically annotated data, and for these evidence paths to be clustered to show 
overall evidence linking the compound to a disease. We believe this is a powerful approach, because it allows 
compound-disease relationships to be ranked by the amount of evidence supporting them. 



Background 

Recent advances in chemical & biological sciences have 
led to an incredible increase in the volume of informa- 
tion about known chemical compounds, genes, diseases, 
and assays. Statistical data from the PubChem Substance 
Database of chemical structures, shows an increase from 
35,379,748 structures in 2007 to 69,088,100 in 2010; the 
number of PubChem Bioassays increased from around 
1000 in 2008 to 434,635 in 2010 [1], and there are 
726,872 compound records and 2,925,588 activities in 
the chemogenomic ChEMBL [2] dataset. Numerous 
other chemical, chemogenomic, and biological data 
(including data extracted from the scholarly literature) 
are also available including ChEBI [3], CTD [4], KEGG 
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[5] and Medline [6] inter alia. Many well-known search 
engines for these data resources have been developed 
like PubChem, which provides chemical structure search 
and bioassay search. This search engine returns an 
abundant supply of chemical information and bioactive 
information based on PubChem Bioassay data. Chem- 
Spider [7] links together compound information across 
the Web and provides free text and structure search 
access to millions of chemical structures. It offers multi- 
ple search modes to do chemical information searching 
on the basis of hundreds of data vendors. 

We can imagine all these information resources as 
buckets for pieces of a very large jigsaw puzzle, each 
bucket containing only pieces of a certain color. To 
assemble the full picture we need to be able to search 
and apply algorithms that span across different buckets 
seamlessly. There are many technologies of utility for 
this, most recently from the Semantic Web Technology 
(SWT) community, like XML (for describing data), 
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OWL (for describing ontologies and taxonomies), and 
RDF (for describing relationships) are allowing data 
aggregation and the representation of meaning and rela- 
tionships in the data, and are now being quite widely 
applied for life science data. LODD [8], Bio2RDF [9], 
Chem2Bio2RDF [10] demonstrate not only how SWT 
can enable integration of multiple sources, but also 
complex query processing using the SPARQL language. 
Our resource, Chem2Bio2RDF, integrates six categories 
of data based on the kinds of biological and chemical 
concepts and relationships they represent: chemical & 
drug, protein & gene, chemogenomics, systems (i.e., PPI 
and pathway), phenotype (i.e., disease and side effect) 
and literature. However, the current version of Chem2- 
Bio2RDF lacks a formal ontology so it is hard for users 
to read and understand the meaning of the metadata 
and harder to do further inference. Once an integrated 
network of compounds, genes, diseases, etc. is in place 
(with an appropriate ontology), as PharmGKB [11] 
establishes knowledge about the relationships among 
drugs, diseases and genes, including their variations and 
gene products, it becomes possible to semantically infer 
new links in the network (i.e. identify new associations) 
via sets of rules, and inference engines that use these 
rules. For example, we might have a rule that if a che- 
mical compound A is highly similar to a drug D that is 
known to be active against a protein target T, we infer 
an association (and thus a network link) between A and 
T (possibly annotated with a confidence value). Seman- 
tic inference has been used in various applications 
including knowledge-based recommender systems [12] 
and human-machine communication [13], but there 
have few applications in the life sciences, Neurocom- 
mons [14] uses SWT for assembling and querying bio- 
medical knowledge from multiple sources and 
disciplines. With this system, scientists will be able to 
load in lists of genes that come off the lab robots, and 
get back those lists of genes with relevant information 
around them based on the public knowledge [15]. Sci- 
NetS Search [16] is an inference search over integrated 
life science databases using SWT. It can implement 
cross-domain search and use statistical scoring. All the 
metadata of databases are described as a set of triples 
consisting of two bio-items and relationships between 
these items. GoRouter [17] is building an RDF model to 
do semantic query and inference, but the inference is 
restricted to the Gene Ontology and its related 
associations. 

In our previous paper [18], we introduced a novel 
tool, WENDI (Web Engine for Non-Obvious Drug 
Information), for aggregating information related to a 
compound to identify relationships. WENDI probes the 
potential biological properties of the compound using 
predictive models, databases, and the scholarly literature, 



in particular, to find non-obvious relationships between 
the compound and assays, genes, and diseases, which 
cross over different types of data sources. The purpose 
of WENDI is not just to return data about a compound 
(such as in a database search): rather it allows a 
researcher to understand the context in which a com- 
pound operates, and to find clues which help them 
understand properties of the compound that they might 
not otherwise have discovered. WENDI does data inte- 
gration for particular query compounds and represents 
its result graph in XML. WENDI architecture is shown 
in Figure 1. 

WENDI has good performance on data integration, but 
it relies on the user manually find associations among the 
kinds of results presented. At this point, we thus 
extended WENDI work to use semantic inference and 
rules to automatically infer new associations based on the 
WENDI XML results. These new associations in aggre- 
gate form clusters of association that build evidence of an 
association between compounds and diseases via multiple 
sources or evidence paths. We have implemented this in 
a tool called Chemogenomic Explorer that uses networks 
of ontologically-enabled RDF statements (e.g. the query 
compound C is similar to compound D, drug D is active 
in assay A, assay A is associated with gene G) along with 
deductive reasoning tools to infer relationships between 
the query compound and genes and diseases, this will 
allow us to cluster insights by disease, and then prioritize 
the output based on the amount of evidence linking a 
compound to a disease. 

Methods 

The WENDI web service is used to create an initial set 
of relational paths in XML. CE adds to the previously 
reported capabilities of WENDI through (i) the applica- 
tion and inference engine and rule set to enable new 
associations to be inferred; (ii) clustering and filtering of 
inferred evidence paths in a completely new interface 
and (iii) the application of Semantic Web languages and 
methods (RDF, SPARQL, OWL) to enable a much 
broader range of capabilities including creation and 
mining of evidence paths, and the annotation of rela- 
tionships using the ontology. These new methods are 
described below. 

WENDI XML includes the direct relationships 
between similar compounds and bioassays, similar com- 
pounds and literature references, bioassays and genes/ 
diseases, and so on. The process of importing this infor- 
mation into CE is as follows: 1. Data preparation as 
described in section 2.1; 2. Semantic representation 
using a CE ontology and presentation in RDF format, 
described in section 2.2; 3. Rule-based Inference 
described in 2.3; and 4. Path ranking based on the num- 
ber of properties for each disease described in 2.4. 
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Figure 1 WENDI Architecture. WENDI main web interface is show at the upper right corner. 



Data preparation 

WENDI aggregates information from diverse data 
sources and predictive models including PubChem 
Compound, PubChem Bioassay, PubChem3D [19], 
DrugBank [20], MRTD [21], CTD [4], ChEMBL [3], 
HuGEpedia [22], KEGG [5], and Medline [6]. Because 
not all of these sources have gene/disease terms related, 
we first extract the data with gene/disease information, 
such as PubChem Bioassay, CTD, ChEMBL, HuGEpedia 
and Medline. We employed different approaches accord- 
ing to the different datasets: for CTD, which already has 
compound-disease relation information, we extract such 
relationships directly; but for other data, the links 
between compounds and diseases are indirect. There are 
two ways to mine this information in the data prepara- 
tion section. PubChem bioassay as an example, (i) 
implementing a SQL function "position" to find gene or 
disease terms from Phenopred Matrix [23] occurring in 
the description of the bioassay, then again based on the 
Phenopred Matrix to find associations between gene 
and disease, finally the link between bioassay-gene-dis- 
ease can be established; (ii) using the GO ontology [24], 
we performed the same SQL clause to find which GO 
terms are noted in the description of bioassay, identified 
the genes associated with the GO term on the basis of 
GO annotation, then used the Phenopred matrix to find 
which diseases are linked to these genes. More informa- 
tion about this extraction can be found in our WENDI 
paper [18]. 

We extracted the above information from WENDI 
XML using XML DOM [25]. All the information is 
extracted into 4 groups: Active-Bioassay, CTD, Chembl, 
and Literature, which include compound, gene, disease, 
or bioassay and journal information. 



Data representation 

In order to provide a formal description of concepts, 
terms, and relationships within the WENDI knowledge 
domain and to make semantic inference possible, we 
use the Web Ontology Language (OWL) to build the 
CE ontology and the Resource Description Framework 
(RDF) in a variety of data interchange formats (e.g. 
RDF/XML, N3, Turtle, N-Triples) to present CE data 
based on the CE ontology. 

CE OWL ontology is constrained for using in our sys- 
tem: i.e., it is an ontology specific to the datasets used in 
CE and is not a generalized chemogenomic ontology. 
Within the ontology we use the following entity classes: 
Chemical Compound, BioAssay, Journal Article, Gene, 
and Disease. These entities can be associated by rela- 
tional ontological terms as shown in Table 1, Also the 
entity and relational terms can then be combined to 
express entity-relationship-entity triples, which are suita- 
ble for representation in RDF. Some triple examples are 
given in Table 2. 

Figure 2 shows the network of possible relationships 
representing by above triples expressed in this system. 
Classes listed in Table 1 are shown in yellow ovals, like 
Journal Article, Chemical Compound, Gene, Disease, 
Bioassay, instances are in white ovals, black arrows show 
direct relationships mined from WENDI, and red arrows 
show inferred relationships mined from CE, like 
"Methysergide-Autistic Disorder-HTRIB", "Methyser- 
gide-Lymphoma-CYPlA2" that can be derived from our 
rule base. 

Inference and the Rule Base 

Inference [26], in the context of SWT, is the discovery 
of new relationships from the known data modelled as a 
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Table 1 Examples of Object Properties and Classes for the CE Ontology 



Properties 


Classes 


Explanation 


isSimilarTo 


Chemical Compound, 
Chemical Compound 


Chemical Compound is similar to Chemical Compound 


isActiveln 


Chemical Compound, 

PuhChpm RinA^^v 


Chemical is tested active in the bioassay 


isContainedln 


Chemical Compound, 
Journal Article 


Chemical is contained in the article 


hasGenes 


Pubchem BioAssay/Drug/Journal Article, Gene 


Bioassay/Drug/Article has found the gene term related the corresponding text; 
Bioassay/Drug/Article has a reference to the gene 


hasDisease 


Pubchem BioAssay/Drug/Journal Article, Disease 


Bioassay/Drug/Article has found the disease term related the corresponding text; 
Bioassay/Drug/Article has a reference to the disease 


isAssociatedWith 


Gene, Disease 


Gene and disease is associated 


hasSimilarity 


Chemical Compound, Similarity 


Chemical has similarity value based on Tanimoto coefficient. 



set of (named) relationships between resources and a set 
of rules automatically. In a mathematical sense, querying 
is a form of inference (being able to infer some search 
results from a mass of data, for example) [27]. We make 
inference to find new inferred relationships between 
compound and disease. 

Once the CE RDF triples are generated, they are 
loaded into Ont Model Class [28] in Jena [29], a Java 
Semantic Web Platform. We are performing the rule- 
based reasoner and forward chaining over RDF graphs. 
A rule for the rule-based reasoner is defined by a Java 
Rule object with a list of body terms (premises), a list of 
head terms (conclusions) and an optional name and 
optional direction. Each term or ClauseEntry is either a 
triple pattern, an extended triple pattern or a call to a 
built-in primitive [29]. Total 8 rules have been defined 
in the CE system [30], 3 of them with explanations are 
listed below: 

[Rule Is (?QueryCompound WO:isSimiliarTo ? 

CompoundID), 
(?CompoundID WO:isActiveIn ?Bioassay), 
(?Bioassay WO:isAssociatedWith ?Disease) 
-> (?QueryCompound WO:mightHasDisease ? 

Disease)] 

Explanation: A relationship is inferred between a com- 
pound and a disease if the query compound is similar to 



Table 2 CE triple examples based on CE Ontology 

CE Triple Examples 

WQquerycmpd WQisSimilarTo WO:cid24871487. 

WO:cid24871487 rdfttype WOChemicalCompound; 
WO:isActiveln WO:aid1469. 

WO:aid1469 rdfttype WO:BioAssay; 
WO:hasGenes WOCOL4A4. 

W0:C0L4A4 rdfttype WO:Gene; 
WQisAssociatedWith WQNephritis. 



another compound that is active against a PubChem 
Bioassay, and that Bioassay is associated with a disease. 

[Rule 2: (?CompoundID WOdsContainedln ?Journal), 

(?Journal WO:hasGene ?Gene), 

(?Gene WOdsAssociatedWith ?Disease) 

-> (?CompoundID WO:mightHasDisease ?Disease)] 

Explanation: A new compound-disease relationship is 
inferred if there a similar compound and a gene co- 
occur in a paper abstract, and the gene and disease co- 
occur in another paper abstract. 

[Rule 3: (?CompoundID WOdsActiveln ?Bioassay), 

(?Bioassay WO:hasGene ?Gene), 

(?Gene WOdsAssociatedWith ?Disease) 

-> (?CompoundID WO:mightHasDisease ?Disease)] 

Explanation: A new compound-disease relationship is 
inferred if a similar compound is active against a bioas- 
say, the bioassay is associated with a gene, and the gene 
co-occurs in a paper abstract with the disease. 

We selected Methysergide [31] as an example query 
compound for the following steps. Methysergide is che- 
mically similar to LSD [32], and it antagonizes the 
effects of serotonin in blood vessels and gastrointestinal 
smooth muscle, but has few of the properties of other 
ergot alkaloids. 

Table 3 shows three RDF statements of Methysergide 
taken from CE RDF network. Based on that, we got 
inferred evidence paths by using above rules. Each state- 
ment along with explanation can be found in this Table 3. 

Methysergide as the query compound, we got a total 
of 63 evidence paths with different diseases, genes, and 
journal information. Individual evidence paths can be 
examined to get to the root data or publications that 
constitute them. For instance, the Autism link is 
demonstrated is interesting as the publications identify 
the link of the compound with HTR1B and the link of 
HTR1B with Autism. LSD is known to affect the out- 
come of Autism [33,34] and thus Methysergide is a rea- 
sonable candidate for investigation. 



Zhu ef al. BMC Bioinformatics 201 1, 12:256 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/256 



Page 5 of 1 2 



Query Compounc 
4 Methyserqlde i- 



rdf type 



:is Cont iinedin 



rdf:type 



:has 



Journal Paper 



PID 



"1330643" 



has 



:has_gene 




Chemical 
Compound 




rdfjtype 




^/^CompoundlD^A 
V^=. 5486180. ' y 



has gend 



isAc iveln 



:is Assoc 



title 



iatedwith 



is Assc ciatedwith 



rdf type 



"Agonist 
activity of 
sumatriptan 
and 

metergoline 



Autistic_DisordeT> 




rdf [ype 




Z 



an Instance 



a Class 



a Relationship 



an Inferred Relationship 



Figure 2 RDF Network for CE 



Browsing RDF is clearly difficult, we have thus built an 
interface that allows the results to be examined and fil- 
tered in a user-friendly fashion, more details about the 
interface shows in the next section. Specifically, evidence 
paths are clustered by disease, and can be filtered via 
disease, compounds, assays, genes, gene families, or 
journal titles. Part of the results for Methysergide are 
shown in Figure 3, using "Autistic_Disorder" as the fil- 
ter, two similar compounds including Methysergide 
itself are related "Autistic_Disorder" with HTRB1 and a 
journal article. The results with AID "410" as the filter 
are shown in Figure 4. Total 20 entries associating with 
different similar compounds/diseases/genes/references 
are related the PubChem Bioassay (AID = 410). 

Path Ranking 

The above process results are often in many evidence 
paths linking compounds and diseases. With a large 
number of results, we need some way to organize and 
prioritize these evidence paths. We cluster all the paths 
based on the different disease terms and then rank the 



clusters based on the number of evidence paths linking 
them. Whilst evidence paths are not necessarily fully 
independent, they do constitute different collections of 
evidence for the same relationship, and thus strengthen 
the chances of the relationship being significant. 

We employ the following SPARQL query clause to 
implement this ranking process based on the inferred 
RDF. It counts the number of properties (?pc) related to 
each disease first, and then return disease terms (?dis) 
as descend order on the basis of (?pc). 

Select ?dis (count(?p) as ?pc) 

WHERE {?dis a wo:Disease; ?p ?o} 

GROUP BY ?dis ORDER BY DESC(?pc) 

Results and discussion 

The architecture for CE is shown in Figure 5. CE does 
data retrieval, data process, and data visualization. 
When query compound submitted to "Data Controller", 
a servlet communicating with client and server, Data 
Controller sends the request to WENDI web service, 
after that, WENDI XML will be passed to "RDF Model 
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Table 3 CE RDF statement examples 



RDF Statements 



Explanation 



wo:ctdcid9681 rdftype wo:ChemicalCompound. 
wo:querycmpd wo:isSimilarTo wo:ctdcid9681; 

wo:hasSimilarity "1.000". 
wo:ctdcid9681 wo:hasName "Methysergide". 
wo:HTR1B rdftype wo:Gene; 
wo:isrelatedTo wo:cid9681; 
wo:islnferredFrom "pubmed id87-137 / M". 
wo:Autistic_Disorder rdfitype wo:Disease; 

wo:isAssociatedWith wo:HTR1B; 
wo:islnferredFrom "pubmedid! 9038234". 



A Methysergide-Autistic Disorder relationship is inferred via rule 2 (gene HTR1B). The similar 
compound (cid = 9681) is Methysergide itself with similarity = 1, it co-occurs with gene 
HTR1B in a same paper (pubmed id = 8743744), and HTR1B and Autistic_Disorder are co- 
occurring in another same paper (pubmed id = 19038234). Then we use rule 2 to establish 

such relations; 



wo:ctdcid1 1865408 rdftype wo:ChemicalCompound. 
wo:querycmpd wolsSimilarTo woxtdcidl 1865408; 
wo:hasSimilarity "0.774". 
wo:ctdcid1 1865408 wo:hasName "Metergoline". 
wo:HTR1B rdftype wo:Gene; 
wolsrelatedTo wo:cid1 1865408; 
wo:islnferredFrom "pubmedid 1330643". 
wo:Autistic_Disorder rdftype wo:Disease; 

wo:isAssociatedWith wo:HTR1B; 
wolslnferredFrom "pubmedid! 9038234". 



A Methysergide-Autistic Disorder relationship is also inferred via rule 2 (again via HTR1B). 
Although this is the same relationship, a different evidence path considered (we will do path 
ranking on these evidence paths later); 



wo:cid54861 80 rdftype wo:ChemicalCompound. 
wo:querycmpd wolsSimilarTo wo:cid54861 80; 
wo:hasSimilarity "0.929". 
wo:cid5486180 wolsActiveln wo:aid410. 
wo:aid41 0 rdfrtype wo:BioAssay; 
wo:hasName "p450-cyp1a2". 
wo:CYP1A2 rdfrtype wo:Gene. 
wo:aid410 wo:hasGene wo:CYP1A2. 
wo:CYP1A2 wo:isAssociatedWith woiymphoma. 
woiymphoma rdftype wo:Disease. 



A Methysergide-Lymphoma relationship is inferred by rulel (via CYP1A2). 



Result for CCC(C0)NC(=0)C1CN(C2CC3=CN(C4=CC=CC(=C34)C2=C1)C)C 

Sack to Chwnogenomlc Explorer Main Papa: httpa:JJchaoitnfov.l nfornWlcs.lryilarM.edu: explorerJIndex.lap 
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Figure 3 Results related to "Autistic_Disorder" for Methysergide shown in the CE Faced Browser by using Disease filter. 
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Figure 4 Results related to AID "410" for Methysergide shown in the CE Faced Browser by using AID filter 



Builder", which handles: CE ontology generation, RDF 
converting, RDF inference, and path ranking. Ranked 
paths will be sent back to Data Controller to convert to 
RDF based JSON file for visualization by using the 
Faceted Browser. SPARQL query builder, is an addi- 
tional CE user platform to make customized SPARQL 
queries based on CE RDF sent back from Data 
Controller. 

Faceted Browser for CE 

CE provides a main web interface, shown in Figure 6. In 
the Figure, Methysergide is drawn using the JME mole- 
cular editor, and its SMILES is transferred to the input 
box. And the results will be displayed in the Faceted 
Browser based on an existing tool [35] and allowing 
multiple filters to be applied. 

SPARQL Query Builder for CE 

After XML to RDF conversion, CE has RDF triples 
based on CE ontology. We therefore saw the utility of 
allowing the direct querying of this RDF data. Since 
SPARQL is a complex language, we implemented a 
SPARQL Query Builder to semi-automate this process. 
The SPARQL query builder for CE is built based on the 



Sesame triple store [36]. The interface is shown in 
Figure 7. Starting with a class, the user can add data 
and object properties associated with it through 
prompted drop-down boxes. Step by step, the SPARQL 
query builder provides an intuitive way to translate user 
question into graph pattern, and then encode it into a 
SPARQL query. 

As an example, given the relationship of Methyser- 
gide [31] with HTR1B, LSD and Autism discussed, so 
we can explore the relationship of similar compounds 
with the serotonin 5-HT1B receptor (the LSD recep- 
tor) with a SPARQL query. We make the SPARQL 
query in the builder with the following 2 steps to get 
journal papers including information about "5-HT" 
receptor: 

1) Find similar compounds to Methysergide from the 
literatures, 

Subject: wo:ChemicalCompound 
Predict: wo:isContainedIn 
Object: wo:journalArticle 

2) The titles of the papers should include "5-HT", 
Subject: wo: journalArticle 

Predict: wo:hasTitle 
Object: "5-HT" 
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Figure 5 CE Architecture and Path Ranking Flowchart 



Chemogenomic Explorer 

Chemogenomic Explorer automatically gathers evidence relating compounds to genes and diseases 
using output from WEXDI, which is the product of a collaboration between the cheminformatlcs 
group at Indiana University. School of Informatics & Computing, and Fii Lilly & Company. 
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CCC(CO)NCI-0)C4C=CJclcccc2cl«cn2ClCC3N(CK:4 ~~ | CSubn.it) 

(Examples : CC 1 qqcqo 1 K>C2CqCC3=qC4=qq =C23K>rC< =OrC5=C| C4=Q rC=CC=C50C (O Hq=0 }CO fO) ) Doxorubicin () 



Figure 6 CE main Web Interface. 
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SPARQL Query Builder [BACK] 



Subject 
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Object 
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woJoumalArticle »M 


( Add Property ) 






woJournalArticle T) Deiete J 


( woihasTitle > ( Delete '] 


5-HT 


(Add Property ) 







^Generate Query^ 



PREFIX rdfs: <http://www.w3.Org/2000/01/rdf-schema#> 




PREFIX rdf: <http://www.w3.Org/1999/02/22-rdf-syntax-ns#> 




PREFIX wo: <http://www.chem info. informatics.indiana.edu/WendiOntology.owlff> 




SELECT distinct ?JournalArticlel hasTitle 




WHEREt 




?ChemicalCompoL(ndl rdf type wo ChemicalCompound . 




?ChemicalCompoundl wo:isContainedln ?JournalArticlel . 




?JoumalArticlel rdftype wo Journal Article . 




?Journa!Articlel wo.has Title ?JournalArticlel hasTitle . 




FILTER REG EX (str{?JournalArticlel_hasT,tle), '5-HT. T)| 









Query 

£ 



Figure 7 Main Web Interface of CE SPARQL Query Builder. 



The implementation of this query is shown in 
Figure 7, and the list of journal titles including "5-HT" 
is shown in the result page of SPARQL Query Builder, 
in Table 4. 

Identification of potential gene targets and diseases for 
Clozapine 

In order to validate CE, we tested it with well-known 
drugs as queries, to see how the ordering of the clus- 
ters of evidence paths related to known uses and side- 
effects for these drugs. For example Clozapine [37] has 
been shown to have superior efficacy when compared 
to olanzapine [38] in the treatment of schizophrenia 
[39]. Some known side effects of Clozapine, are cardio- 
myopathy (deterioration of the function of the myocar- 
dium), and cardiac hypertrophy. For this drug, CE 
indeed predominately returns compound-disease paths 
that relate to schizophrenia (i.e. schizophrenia has 
more evidence paths than any other disease). It also 
identifies side effects of the drug correctly as hyper- 
trophic cardiomyopathy, and cardiovascular system dis- 
ease, both of which are supported by the literature 
[40,41]. This is shown in Figure 8. 



Exploring newly submitted compounds from PubChem 

Pubchem is a popular public database of chemical com- 
pounds and their activities against biological assays. 
Since CE is designed for use with "new" compounds as 
queries (i.e. compounds for which there is not a lot of 
data available), we chose a set of very recently-added 
compounds in PubChem which had no or little asso- 
ciated bioactivity information recorded. This was done 
using a constrained search in PubChem [42] to return 
compounds submitted only in 2011. 

For example, as shown in Figure 9, the compound 
with CID 49835692 [43] has no associated bioactivity 
data recorded. However, through its analysis of similar 
structures, some significant potential bioactivities and 
disease associations are suggested by CE. 

We were using RDF network to make inference 
between compounds and diseases. As the experiments 
discussed before, not only the most related diseases 
could be sorted out, but also general guideline will be 
generated to conduct new compounds analysis. The 
power of the methodology has been clearly demon- 
strated to retrieve pertinent information in particular 
domain without any difficulties engendering by the data 



Table 4 List of Journal Titles including "5-HT" receptor 

Journal Titles 

First Pharmacophoric Hypothesis for 5-HT7 Antagonism 

Novel, Potent, and Selective 5-HT3 Receptor Antagonists Based on the Arylpiperazine Skeleton: Synthesis, Structure, Biological Activity, and 

Comparative Molecular Field Analysis Studies 

Synthesis of 2-Piperazinylbenzothiazole and 2-Piperazinylbenzoxazole Derivatives with S-HT3 Antagonist and 5-HT4 Agonist Properties 



Novel and Highly Potent 5-HT3 Receptor Agonists Based on a Pyrroloquinoxaline Structure 
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Figure 8 CE results for Clozapine 



Result for CCl(COC(C(ClNC)0)OC2C(CC(C(C20)OC3C(CC=C(03)CNCC4CCNCC4)N)N)NC(=0)C(CN)0)0 
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Figure 9 More Chemogenomic information for New Compound from PubChem 
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tsunami. In addition, it expands the possible usages/lin- 
kages within the limited volumes of disease information 
regarding to a specific compound. 

Conclusions 

We present a new approach to the association search of 
chemical compounds and diseases using semantic infer- 
ence in this work. Semantic inference produces evidence 
paths relating compounds and diseases via genes, publi- 
cation, bioassays and drugs. We previously released an 
aggregative data-mining tool, WENDI, for drug discov- 
ery using aggregate web services. In this paper, we have 
shown how the application of Semantic Web methods 
(RDF, SPARQL and OWL ontologies) along with rule- 
based inference, path ranking and a faceted browse, can 
produce a tool for exploring new compound-disease 
associations based on evidence paths from WENDI. 

Future work 

The current version of CE explores the chemogenomic 
information of chemical compounds. In the future, we 
will consider more efficient ways to mine compound- 
gene, compound-disease links from more chemoge- 
nomic data, and plan to aggregate additional data and 
inference rules, also increase collaboration with Chem2- 
Bio2RDF in order to enable CE to link with more 
diverse data. We also intend to expand beyond Chem2- 
Bio2RDF to chemical biology, where we can consider 
other relations like chemical-gene, chemical-pathway, 
chemical-side effect, etc. In addition, we would like to 
add the functionality to process batches of molecules. 
For this case, we will consider the issues of information 
summarization and visualization, i.e. how to organize 
more data in a readable way. Because of the increased 
volume of data and results, some current algorithms will 
become out of date. We will also take other ranking 
algorithms into account such as evidence importance. 
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